ggplot part 2At the end of this exercise, you will be able to:
1. Practice applying the syntax of building plots using ggplot2.
2. Build a boxplot using ggplot2.
3. Customize labels on axes using labs and themes.
4. Use color, fill, and group to customize plots and improve overall aesthetics.
At this point, we’ve plotted a few things and have a high-level understanding of how R can read in and summarize data. It is OK if you need to go back through past work and find bits of code that work for you, but try and force yourself to originate new chunks.
library(tidyverse)
library(naniar)
library(janitor)
library(here)
library(skimr)
library(janitor)
library(here)
library(palmerpenguins)
Now that you have been introduced to ggplot, let’s practice a few more plot types. Remember that plots are built in layers: plot= data + geom_ + aesthetics. We have to specify each of these in order for a plot to be produced. If you get stuck, it is often helpful to stop and make a quick sketch of what you want or expect to see on a piece of scratch paper.
Let’s review using the penguins data. First, get an idea of the structure: Are there NA’s? Are the variables discrete, categorical, or continuous?
penguins
## # A tibble: 344 x 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torge… 39.1 18.7 181 3750
## 2 Adelie Torge… 39.5 17.4 186 3800
## 3 Adelie Torge… 40.3 18 195 3250
## 4 Adelie Torge… NA NA NA NA
## 5 Adelie Torge… 36.7 19.3 193 3450
## 6 Adelie Torge… 39.3 20.6 190 3650
## 7 Adelie Torge… 38.9 17.8 181 3625
## 8 Adelie Torge… 39.2 19.6 195 4675
## 9 Adelie Torge… 34.1 18.1 193 3475
## 10 Adelie Torge… 42 20.2 190 4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
To start, let’s determine how many penguins are on an island in this data.
penguins %>% count(island, species, sort = F)
## # A tibble: 5 x 3
## island species n
## <fct> <fct> <int>
## 1 Biscoe Adelie 44
## 2 Biscoe Gentoo 124
## 3 Dream Adelie 56
## 4 Dream Chinstrap 68
## 5 Torgersen Adelie 52
What if we wanted a plot that showed the number of measured penguins per species and the number of measured penguins per island?
penguins %>%
group_by(island) %>%
ggplot(aes(x=species))+
geom_bar()
How about average bill length by sex?
penguins %>%
filter(sex!="NA") %>%
group_by(sex) %>%
summarise(ave_bill_length=mean(bill_length_mm))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## sex ave_bill_length
## <fct> <dbl>
## 1 female 42.1
## 2 male 45.9
penguins %>%
filter(sex!="NA") %>%
group_by(sex) %>%
summarise(ave_bill_length=mean(bill_length_mm)) %>%
ggplot(aes(x=sex, y=ave_bill_length)) +
geom_col()
## `summarise()` ungrouping output (override with `.groups` argument)
Database of vertebrate home range sizes.
Reference: Tamburello N, Cote IM, Dulvy NK (2015) Energy and the scaling of animal space use. The American Naturalist 186(2):196-211. http://dx.doi.org/10.1086/682070.
Data: http://datadryad.org/resource/doi:10.5061/dryad.q5j65/1
homerange <- read_csv("Tamburelloetal_HomeRangeDatabase.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## mean.mass.g = col_double(),
## log10.mass = col_double(),
## mean.hra.m2 = col_double(),
## log10.hra = col_double(),
## preymass = col_double(),
## log10.preymass = col_double(),
## PPMR = col_double()
## )
## See spec(...) for full column specifications.
homerange data? Does it have any NA’s? Do a quick exploratory analysis of your choice below.glimpse(homerange)
## Rows: 569
## Columns: 24
## $ taxon <chr> "lake fishes", "river fishes", "river fish…
## $ common.name <chr> "american eel", "blacktail redhorse", "cen…
## $ class <chr> "actinopterygii", "actinopterygii", "actin…
## $ order <chr> "anguilliformes", "cypriniformes", "cyprin…
## $ family <chr> "anguillidae", "catostomidae", "cyprinidae…
## $ genus <chr> "anguilla", "moxostoma", "campostoma", "cl…
## $ species <chr> "rostrata", "poecilura", "anomalum", "fund…
## $ primarymethod <chr> "telemetry", "mark-recapture", "mark-recap…
## $ N <chr> "16", NA, "20", "26", "17", "5", "2", "2",…
## $ mean.mass.g <dbl> 887.00, 562.00, 34.00, 4.00, 4.00, 3525.00…
## $ log10.mass <dbl> 2.9479236, 2.7497363, 1.5314789, 0.6020600…
## $ alternative.mass.reference <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ mean.hra.m2 <dbl> 282750.00, 282.10, 116.11, 125.50, 87.10, …
## $ log10.hra <dbl> 5.4514026, 2.4504031, 2.0648696, 2.0986437…
## $ hra.reference <chr> "Minns, C. K. 1995. Allometry of home rang…
## $ realm <chr> "aquatic", "aquatic", "aquatic", "aquatic"…
## $ thermoregulation <chr> "ectotherm", "ectotherm", "ectotherm", "ec…
## $ locomotion <chr> "swimming", "swimming", "swimming", "swimm…
## $ trophic.guild <chr> "carnivore", "carnivore", "carnivore", "ca…
## $ dimension <chr> "3D", "2D", "2D", "2D", "2D", "2D", "2D", …
## $ preymass <dbl> NA, NA, NA, NA, NA, NA, 1.39, NA, NA, NA, …
## $ log10.preymass <dbl> NA, NA, NA, NA, NA, NA, 0.1430148, NA, NA,…
## $ PPMR <dbl> NA, NA, NA, NA, NA, NA, 530, NA, NA, NA, N…
## $ prey.size.reference <chr> NA, NA, NA, NA, NA, NA, "Brose U, et al. 2…
naniar::miss_var_summary(homerange)
## # A tibble: 24 x 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 alternative.mass.reference 561 98.6
## 2 preymass 502 88.2
## 3 log10.preymass 502 88.2
## 4 PPMR 502 88.2
## 5 prey.size.reference 502 88.2
## 6 N 375 65.9
## 7 primarymethod 1 0.176
## 8 taxon 0 0
## 9 common.name 0 0
## 10 class 0 0
## # … with 14 more rows
Scatter plots are good at revealing relationships that are not readily visible in the raw data. For now, we will not add regression lines or calculate any r2 values.
In the case below, let’s explore whether or not there is a relationship between animal mass and home range. We are using the log transformed values because there is a large difference in mass and home range among the different species in the data.
As we go through each of these steps, quiz yourself on what you remember.
Let’s first grab our variables. What’s a command we can use to know what they are called?
names(homerange)
## [1] "taxon" "common.name"
## [3] "class" "order"
## [5] "family" "genus"
## [7] "species" "primarymethod"
## [9] "N" "mean.mass.g"
## [11] "log10.mass" "alternative.mass.reference"
## [13] "mean.hra.m2" "log10.hra"
## [15] "hra.reference" "realm"
## [17] "thermoregulation" "locomotion"
## [19] "trophic.guild" "dimension"
## [21] "preymass" "log10.preymass"
## [23] "PPMR" "prey.size.reference"
Remember that big datasets have data that may overlap. We used geom_jitter() before to add some random noise to the data and separating some of the individual points. Let’s write a plot for this.
ggplot(data = homerange, mapping = aes(x = log10.mass, y = log10.hra)) +
geom_jitter()
And let’s add a best fit line. What did we change that added the line?
ggplot(data=homerange, mapping=aes(x=log10.mass, y=log10.hra)) +
geom_point()+
geom_smooth(method=lm, se=T) #adds the regression line, `se=TRUE` will add standard error
## `geom_smooth()` using formula 'y ~ x'
Let’s take a look at the variable trophic.guild. We’ll repeat some of the steps from before in barplotting, and then we’ll practice plotting new things!
We will first specify an x-axis and a y-axis.
homerange %>%
filter(family=="salmonidae") %>%
ggplot(aes(x=common.name, y=log10.mass))+
geom_col()
Take a look at the graph. What’s that line that says ‘filter’ probably doing?
Let’s plot another one. Take a look at the graph and trace back what the code was asking.
homerange %>%
group_by(class) %>%
summarize(mean_body_wt=mean(log10.mass)) %>%
ggplot(aes(x=class, y=mean_body_wt))+
geom_col()
## `summarise()` ungrouping output (override with `.groups` argument)
homerange data to include mammals only.homerange %>%
filter(class=="mammalia")
## # A tibble: 238 x 24
## taxon common.name class order family genus species primarymethod N
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 mamm… giant gold… mamm… afro… chrys… chry… trevel… telemetry* <NA>
## 2 mamm… Grant's go… mamm… afro… chrys… erem… granti telemetry* <NA>
## 3 mamm… pronghorn mamm… arti… antil… anti… americ… telemetry* <NA>
## 4 mamm… impala mamm… arti… bovid… aepy… melamp… telemetry* <NA>
## 5 mamm… hartebeest mamm… arti… bovid… alce… busela… telemetry* <NA>
## 6 mamm… barbary sh… mamm… arti… bovid… ammo… lervia telemetry* <NA>
## 7 mamm… American b… mamm… arti… bovid… bison bison telemetry* <NA>
## 8 mamm… European b… mamm… arti… bovid… bison bonasus telemetry* <NA>
## 9 mamm… goat mamm… arti… bovid… capra hircus telemetry* <NA>
## 10 mamm… Spanish ib… mamm… arti… bovid… capra pyrena… telemetry* <NA>
## # … with 228 more rows, and 15 more variables: mean.mass.g <dbl>,
## # log10.mass <dbl>, alternative.mass.reference <chr>, mean.hra.m2 <dbl>,
## # log10.hra <dbl>, hra.reference <chr>, realm <chr>, thermoregulation <chr>,
## # locomotion <chr>, trophic.guild <chr>, dimension <chr>, preymass <dbl>,
## # log10.preymass <dbl>, PPMR <dbl>, prey.size.reference <chr>
homerange %>%
filter(class=="mammalia") %>%
group_by(trophic.guild) %>%
count()
## # A tibble: 2 x 2
## # Groups: trophic.guild [2]
## trophic.guild n
## <chr> <int>
## 1 carnivore 80
## 2 herbivore 158
homerange %>%
filter(class=="mammalia") %>%
group_by(trophic.guild) %>%
count() %>%
ggplot(aes(y=trophic.guild, x=n))+
geom_col()
Boxplots help us visualize a range of values. So, on the x-axis we typically have something categorical and the y-axis is the range. In the case below, we are plotting log10.mass by taxonomic class in the homerange data. geom_boxplot() is the geom type for a standard box plot. The center line in each box represents the median, not the mean.
Let’s look at the variable log10.mass grouped by taxonomic class.
homerange %>%
group_by(class) %>%
summarize(min_log10.mass=min(log10.mass),
max_log10.mass=max(log10.mass),
median_log10.mass=median(log10.mass))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 4
## class min_log10.mass max_log10.mass median_log10.mass
## <chr> <dbl> <dbl> <dbl>
## 1 actinopterygii -0.658 3.55 2.08
## 2 aves 0.712 4.95 1.82
## 3 mammalia 0.620 6.60 3.33
## 4 reptilia 0.539 4.03 2.51
homerange %>%
ggplot(aes(x = class, y = log10.mass)) +
geom_boxplot()
log10.mass.homerange %>%
group_by(trophic.guild) %>%
summarize(min_log10.mass=min(log10.mass),
max_log10.mass=max(log10.mass),
mean_log10.mass=mean(log10.mass),
median_log10.mass=median(log10.mass),
total_n=n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 6
## trophic.guild min_log10.mass max_log10.mass mean_log10.mass median_log10.ma…
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 carnivore -0.658 5.05 2.24 2.28
## 2 herbivore 0.398 6.60 3.13 2.97
## # … with 1 more variable: total_n <int>
homerange %>%
ggplot(aes(x=trophic.guild, y=log10.mass))+
geom_boxplot()
homerange %>%
filter(taxon=="mammals" & trophic.guild=="carnivore") %>%
summarize(min_log10.mass=min(log10.mass),
max_log10.mass=max(log10.mass),
mean_log10.mass=mean(log10.mass),
median_log10.mass=median(log10.mass),
total_n=n()) %>%
pivot_longer(cols=everything(),
names_to="measurement",
values_to="value")
## # A tibble: 5 x 2
## measurement value
## <chr> <dbl>
## 1 min_log10.mass 0.620
## 2 max_log10.mass 5.05
## 3 mean_log10.mass 3.03
## 4 median_log10.mass 3.29
## 5 total_n 80
homerange %>%
filter(taxon=="mammals" & trophic.guild=="carnivore") %>%
select(family, log10.mass)
## # A tibble: 80 x 2
## family log10.mass
## <chr> <dbl>
## 1 chrysochloridae 2.64
## 2 chrysochloridae 1.36
## 3 canidae 3.70
## 4 canidae 4.44
## 5 canidae 3.87
## 6 canidae 3.60
## 7 canidae 3.65
## 8 canidae 3.51
## 9 canidae 3.32
## 10 eupleridae 3.98
## # … with 70 more rows
homerange %>%
filter(taxon=="mammals" & trophic.guild=="carnivore") %>%
select(family, log10.mass) %>%
ggplot(aes(x=family, y=log10.mass))+
geom_boxplot()+
coord_flip()+
labs(title="Family vs. Log10.mass")
Now that we have practiced scatter plots, bar plots, and box plots we need to learn how to adjust their appearance to suit our needs. Let’s start with labeling x and y axes.
Let’s revisit the plot of log10.mass and log10.hra.
ggplot(data=homerange, mapping=aes(x=log10.mass, y=log10.hra)) +
geom_point()+
geom_smooth(method=lm, se=T)
## `geom_smooth()` using formula 'y ~ x'
The plot looks clean, but it is incomplete. A reader unfamiliar with the data might have a difficult time interpreting the labels. To add custom labels, we use the labs command.
ggplot(data=homerange, mapping=aes(x=log10.mass, y=log10.hra)) +
geom_point()+
geom_smooth(method=lm, se=T)+
labs(title="Log 10 Mass vs. Log 10 Home Range Area",
x="Log 10 Mass (g)",
y="Log 10 Home Range Area (m2)")
## `geom_smooth()` using formula 'y ~ x'
We can improve the plot further by adjusting the size and face of the text. We do this using theme().
ggplot(data=homerange, mapping=aes(x=log10.mass, y=log10.hra)) +
geom_point()+
geom_smooth(method=lm, se=T)+
labs(title="Log 10 Mass vs. Log 10 Home Range Area",
x="Log 10 Mass (g)",
y="Log 10 Home Range Area (m2)") +
theme(plot.title = element_text(size = 12, face = "bold"),
axis.text = element_text(size = 10),
axis.title = element_text(size = 10))
## `geom_smooth()` using formula 'y ~ x'
The rel() option changes the relative size of the title to keep things consistent. Adding hjust allows control of title position.
ggplot(data=homerange, mapping=aes(x=log10.mass, y=log10.hra)) +
geom_point()+
geom_smooth(method=lm, se=T)+
labs(title="Log 10 Mass vs. Log 10 Home Range Area",
x="Log 10 Mass",
y="Log 10 Home Range Area (m2)") +
theme(plot.title = element_text(size = rel(1.5), hjust = 0.5))
## `geom_smooth()` using formula 'y ~ x'
There are lots of options for aesthetics. An aesthetic can be assigned to either numeric or categorical data. fill is a common grouping option; notice that an appropriate key is displayed when you use one of these options.
ggplot(data=homerange, aes(x=realm, fill=realm))+geom_bar()+
labs(title = "# Species by Realm",
x = "realm",
y = NULL) +
theme(plot.title = element_text(size = rel(1.5), hjust = 0.5))
size adjusts the size of points relative to a continuous variable.
ggplot(data=homerange, aes(x=log10.hra, y=log10.mass, size=log10.hra))+
geom_point(na.rm=T)
There are many options to create nice plots in ggplot. One useful trick is to store the plot as a new object and then experiment with geom’s and aesthetics.
p <- homerange %>%
ggplot(aes(x= log10.mass, y= log10.hra))
Play with point size.
p +geom_point(size=1)
Map shapes to another categorical variable.
p+geom_point(aes(shape=thermoregulation, color=thermoregulation), size=1.75)
At this point you should be comfortable building bar plots that shows counts of observations using geom_bar(). Last time we explored the fill option in box plots as a way to bring color to the plot; i.e. we filled by the same variable that we were plotting. What happens when we fill by a different categorical variable?
Let’s start by counting how many observations we have in each taxonomic group.
homerange %>% count(taxon)
## # A tibble: 9 x 2
## taxon n
## <chr> <int>
## 1 birds 140
## 2 lake fishes 9
## 3 lizards 11
## 4 mammals 238
## 5 marine fishes 90
## 6 river fishes 14
## 7 snakes 41
## 8 tortoises 12
## 9 turtles 14
Now let’s make a bar plot of these data.
homerange %>%
ggplot(aes(x = taxon)) + geom_bar() +
coord_flip() +
labs(title = "Observations by Taxon in Homerange Data",
x = "Taxonomic Group")
By specifying fill=trophic.guild we build a stacked bar plot that shows the proportion of a given taxonomic group that is an herbivore or carnivore.
homerange %>%
ggplot(aes(x = taxon, fill = trophic.guild)) + geom_bar() +
coord_flip() +
labs(title = "Observations by Taxon in Homerange Data",
x = "Taxonomic Group",
fill = "Trophic Guild")
We can also have counts of each trophic guild within taxonomic group shown side-by-side by specifying position="dodge".
homerange %>%
ggplot(aes(x = taxon, fill = trophic.guild)) + geom_bar(position = "dodge") +
coord_flip() +
labs(title = "Observations by Taxon in Homerange Data",
x = "Taxonomic Group",
fill = "Trophic Guild")
Here is the same plot oriented vertically.
homerange %>%
ggplot(aes(x = taxon, fill = trophic.guild)) +
geom_bar(position = "dodge") +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
labs(title = "Observations by Taxon in Homerange Data",
x = "Taxonomic Group",
fill= "Trophic Guild")
We can also scale all bars to a percentage (or proportion).
homerange %>%
ggplot(aes(x = taxon, fill = trophic.guild))+
geom_bar(position = position_fill())+
scale_y_continuous(labels = scales::percent)+
coord_flip()